Generating Chinese Named Entity Data from a Parallel Corpus
نویسندگان
چکیده
Annotating Named Entity Recognition (NER) training corpora is a costly process but necessary for supervised NER systems. This paper presents an approach to generate large-scale Chinese NER training data from an EnglishChinese discourse level aligned parallel corpus. Difficulty of NER is different among languages due to their unique features. For example, the performance of English NER systems is usually higher than the Chinese ones on average. In our method, we first employ a high performance NER system on one side of a bilingual corpus. And then, we project the NE labels to the other side according to the word level alignment. At last, we select high-quality labeled sentences using different strategies and generate an NER training corpus. In our experiments, we generate a Chinese NER corpus with 167,100 sentences through an EnglishChinese parallel corpus. The system trained on the automatically generated corpus attains a comparable result with the one trained on the manuallyannotated corpus. Further experiments show that the NER performance is significantly improved on two different evaluation sets by using the generated training data as an additional corpus to the manually-labeled data.
منابع مشابه
Finding and Typing New Named Entities in Tibetan from Chinese-Tibetan Parallel Corpora
Currently there is much interest in the automatic acquisition of entities, with the goal of Named Entity Recognition (NER). However previous work has focused primarily on major languages, with the large, structured, and semantically rich knowledge bases and using the large corpus with annotated NER tags. In this paper, we describe a method for Chinese-Tibetan bilingual named entity recognition ...
متن کاملMulti-feature Based Chinese-English Named Entity Extraction from Comparable Corpora
Bilingual Named Entity Extraction is important to some cross language information processes such as machine translation (MT), cross-lingual information retrieval (CLIR), etc. A lot of previous work extracted bilingual Named Entities from parallel corpus. Here we propose a multifeature based method to extract bilingual Named Entities from comparable corpus. We first recognize the Chinese and Eng...
متن کاملپیکره اعلام: یک پیکره استاندارد واحدهای اسمی برای زبان فارسی
Named entity recognition (NER) is a natural language processing (NLP) problem that is mainly used for text summarization, data mining, data retrieval, question and answering, machine translation, and document classification systems. A NER system is tasked with determining the border of each named entity, recognizing its type and classifying it into predefined categories. The categories of named...
متن کاملToward a Name Entity Aligned Bilingual Corpus
This paper describes a co-training framework in which, through named entity aligned bilingual text, named entity taggers can complement and improve each other via an iterative process. This co-training approach allows us to 1) apply our method to not only parallel but also comparable text, greatly extending the applicability of the approach; and to 2) adapt named entity taggers to new domains; ...
متن کاملUsing Word Embeddings to Translate Named Entities
In this paper we investigate the usefulness of neural word embeddings in the process of translating Named Entities (NEs) from a resource-rich language to a language low on resources relevant to the task at hand, introducing a novel, yet simple way of obtaining bilingual word vectors. Inspired by observations in (Mikolov et al., 2013b), which show that training their word vector model on compara...
متن کامل